[chore] RFC: Count `PartialSuccess` in `exporterhelper` by braydonk · Pull Request #15280 · open-telemetry/opentelemetry-collector

braydonk · 2026-05-08T21:22:30Z

Description

This RFC is to introduce proper accounting for PartialSuccess accounting in a way that aligns with the definition of partial success in the OTLP spec.

Prior art: My first attempt to do this was in #14152, but I went back to the drawing board after realizing it was insufficient compared to the OTLP spec and there was some unrelated refactor to the obs_report_sender code.

Link to tracking issue

#13423
#14440

AI Usage Disclosure

Minor Antigravity usage in the linked branches in the Implementation Plan section. RFC itself was handwritten.

This RFC is to introduce proper accounting for `PartialSuccess` accounting in a way that aligns with the definition of partial success in the OTLP spec.

codspeed-hq · 2026-05-08T21:28:42Z

Merging this PR will improve performance by 48.81%

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚠️

Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 2 improved benchmarks
✅ 5 untouched benchmarks
⏩ 76 skipped benchmarks¹

Performance Changes

	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	`BenchmarkMemoryQueueWaitForResult`	74.9 µs	50.3 µs	+48.81%
⚡	`BenchmarkPersistentQueue`	205.2 µs	148 µs	+38.67%

_{Comparing braydonk:rfc_partial_success (613d4b5) with main (91b32ef)}

76 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

codecov · 2026-05-08T21:34:59Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 91.23%. Comparing base (91b32ef) to head (613d4b5).

Additional details and impacted files

@@           Coverage Diff           @@
##             main   #15280   +/-   ##
=======================================
  Coverage   91.23%   91.23%           
=======================================
  Files         703      703           
  Lines       45902    45902           
=======================================
  Hits        41878    41878           
  Misses       2820     2820           
  Partials     1204     1204

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dmitryax · 2026-05-08T22:05:01Z

+
+Two new error APIs will be introduced:
+
+### `Countable` error 


I'm not sure I understand why we need this one... What use cases we want to cover where this one should be used instead of PartialSuccess?

I suppose it's possible to reduce this just to PartialSuccessError, but since this is also going to be used by the prometheus receiver (see #14440) I thought the PartialSuccess error, which is so tied to exporter use cases, wouldn't make sense. Wanted to make it flexible so they could report any error they might want and simply wrap it in a count.

I don't see a reason why the Prometheus receiver couldn't use a partialsuccess error, as long as we can provide the failed telemetry count somehow.

dmitryax · 2026-05-08T22:20:51Z

+
+#### Why does it implement status code `OK`?
+
+A `PartialSuccess` is nominally a success according to the spec. As a result, when `exporterhelper` processes this it needs to know that the Go error it received doesn't actually correspond to a real failure to send data.This will allow accurate setting of the span status (it shouldn't be `Failed` upon `PartialSuccess`) and determining of `error.type` (a new custom type called `Partial_Success` would be introduced).


I think we need to clearly define if we want this error to be propagated by the exporter. If we agree that exporter clearly reports the failed portion as dropped in the internal telemetry and returns nil, we don't need to include GRPC status code resolution. However, in order to decide on this, we need to clearly outline how the internal telemetry will be affected in the exporter and upstream components (if we propagate)

I think the error needs to be propogated by the exporter for sure, I can't think of any nice way around that. This means that the GRPC status code resolution is then necessary for obs_report_sender.go to count the error with an appropriate error.type and recognize not to mark the span as failed.

The rest of the error API was designed such that the error could be worked with as any other permanent error when any other part of exporterhelper other than obs_report_sender.go receive it.

bogdandrutu · 2026-05-10T05:28:03Z

+
+Two new error APIs will be introduced:
+
+### `Countable` error 


Do we need to have this public surface?

This is an error that can any other error can be wrapped in along with a count of failures. This will allow for cases where producers of an error want to include an amount of failures.

what is the unit for "failures"? May not be bad to have this as "PartialError" and map 1:1 the PartialSuccess.

1$ question: In case of a connector (like traces to metrics) if we fail and get back that we failed to export 2 metrics. For components prior to the trace to metrics connector that number "2" does not mean "2 spans failed". How do we treat this case?

Do we need to have this public surface?

Do you mean does the interface need to be public? Not necessarily, I could just make a public function func GetErrorCount(err error) (errCount int, isCountable bool). I thought that was more awkward than the current API of getting the error as xconsumererror.Countable and then calling the Count method on it. But the former makes for less public surface.

what is the unit for "failures"?

In the error itself I intentionally exclude unit. The unit would be decided by whoever is consuming the error to record failures. In my unfinished otlp exporter draft this ends up working out based on the fact that the exporter itself has a signal type, so when obsReportSender is recording failures for a logging otlp exporter, that failed_sent metric already has a unit of {log}.

In case of a connector (like traces to metrics)

I hadn't considered connectors, since I was only thinking of exporters for this RFC. However I think most of the design will still fit in, but the big open question of how to record this for more complex scenarios like fanoutconsumer was blocking the ability to at least solve this for exporterhelper and receiverhelper, where the case is more clear-cut.

(I'm not sure I understand how error metrics are supposed to be recorded at all for fanoutconsumer scenarios, let alone being able to report partial counts, but I think if the error recording behaviour is clear for fanoutconsumer the partial part is relatively simple to slot in).

For components prior to the trace to metrics connector that number "2" does not mean "2 spans failed".

Are you talking about the scenario where something failed downstream from the metrics post-conversion? Assuming the simple scenario of 1-to-1 traces to metrics and no fanout, I see two scenarios:

The trace/span failed to be converted to a metric or was refused by the conversion process in some way. This seems pretty clear that "1" failed item is "1 failed span in conversion" and would be 1 refused span that obsconsumer could potentially record by recognizing xconsumererror.Countable here.

The metric failed post-conversion by something down its pipeline. I think it's essentially impossible to determine whether a failed metric in that pipeline somehow lines up with a particular span that was converted to a metric. I think it's fair to say that the tracestometrics connector did its job, and considers all spans successful, and if the metric fails after successful conversion then that's for components further down the line to properly report.

I wasn't part of any connector design discussions that happened or may be happening, so the above is based on my own conclusions by reading and not by any definitive explanation given to me.

bogdandrutu · 2026-05-10T05:30:06Z

+
+A `PartialSuccess` error will wrap an internal error as permanent, implement [GRPC status code resolution](https://pkg.go.dev/google.golang.org/grpc/status#FromError) to code `OK`, and wrap it all as a `Countable` with a count of failures.
+
+#### Why should the internal error be permanent?


@open-telemetry/technical-committee is this a correct read of the protocol definition? I personally think it is.

Yes.

However, I think the protocol specification can be improved/fixed to allow the service to provide more information regarding the partial success, which would allow the client to retry the items that failed to be delivered if these are retriable.

Yes, I think this is the correct read. The OTLP spec says:

The client MUST NOT retry the request when it receives a partial success response where the partial_success is populated.

The partial success response signals that some items are successfully received and some others are bad and cannot be received, so no retries are needed for either.

and some others are bad and cannot be received, so no retries are needed for either.

@tigrannajaryan What could be a reason for being bad? Why couldn't they be received? (We scenarios, where it's not acceptable to have kind of unknown failures)

What could be a reason for being bad?

Any violation of requirements defined in https://github.com/open-telemetry/opentelemetry-proto/tree/main/opentelemetry/proto could be the reason the data cannot be received.

bogdandrutu

Overall looks good, needs more polish on some edge cases.

jamesmoessis · 2026-05-13T06:17:54Z

+* Fully counting data drops throughout the pipeline
+    - The new `Countable` interface does open the possibility of counting partial failures at various points throughout the pipeline, which might be worth exploring down the line. The only thing I want to address right now is ensuring these are properly counted at the exporter metric level.
+* Partial retries
+    - This is purely covering scenarios that align with the OTLP spec's definition of `PartialSuccess`, which does not come with any functionality to determine exactly which items failed


Given that consumererror can currently be used to trigger partial retries in the retry_sender, won't it be necessary to continue to support both:

partial successes that can be partially retried and counted partially once retries fully fail

partial successes that can only be counted

Otherwise, does it mean we will have behaviour where consumererror.NewTraces(err, failedTraces) indicates partial failure but it isn't counted partially?

dashpole · 2026-05-14T16:01:06Z

+
+### `PartialSuccess` error
+
+A `PartialSuccess` error will wrap an internal error as permanent, implement [GRPC status code resolution](https://pkg.go.dev/google.golang.org/grpc/status#FromError) to code `OK`, and wrap it all as a `Countable` with a count of failures.


Why does it need to implement gRPC status code resolution? It looks like FromError looks for wrapped errors as well (using errors.As: https://github.com/grpc/grpc-go/blob/cb18228317ff523e63d931b4058b0329585b7dcd/status/status.go#L113). So if, for example, the OTLP exporter wants to provide the gRPC status code OK in its PartialSuccess error, it should just be able to use standard Go error wrapping to do so.

dashpole · 2026-05-14T16:03:40Z

+
+This API may end up being useful in other places than `PartialSuccess`, such as [in `receiverhelper`](https://github.com/open-telemetry/opentelemetry-collector/issues/14440).
+
+### `PartialSuccess` error


Is this error meant to specifically be the OTLP "PartialSuccess" error? Or is this meant to be a collector type that means "The operation succeeded overall, but some parts failed"? If it is OTLP-specific, it shouldn't be in exporterhelper, and shouldn't be used by other components. If it is a generic "this thing partially succeeded" error, then it shouldn't do things like set gRPC status codes, etc.

RFC: Count PartialSuccess in exporterhelper

613d4b5

This RFC is to introduce proper accounting for `PartialSuccess` accounting in a way that aligns with the definition of partial success in the OTLP spec.

braydonk requested review from a team, bogdandrutu, codeboten, dmitryax and mx-psi as code owners May 8, 2026 21:22

github-actions Bot added the rfc:approvals-needed This RFC needs approvals from collector-approvers label May 8, 2026

braydonk changed the title ~~RFC: Count PartialSuccess in exporterhelper~~ [chore] RFC: Count PartialSuccess in exporterhelper May 8, 2026

dmitryax reviewed May 8, 2026

View reviewed changes

bogdandrutu reviewed May 10, 2026

View reviewed changes

Comment thread docs/rfcs/partial-success-counting.md

bogdandrutu reviewed May 10, 2026

View reviewed changes

dmitryax mentioned this pull request May 10, 2026

[exporter/exporterhelper] Add DroppedItemsErr to report intentionally dropped items #15028

Open

jamesmoessis reviewed May 13, 2026

View reviewed changes

dashpole reviewed May 14, 2026

View reviewed changes


		#### Why does it implement status code `OK`?

		A `PartialSuccess` is nominally a success according to the spec. As a result, when `exporterhelper` processes this it needs to know that the Go error it received doesn't actually correspond to a real failure to send data.This will allow accurate setting of the span status (it shouldn't be `Failed` upon `PartialSuccess`) and determining of `error.type` (a new custom type called `Partial_Success` would be introduced).


		A `PartialSuccess` error will wrap an internal error as permanent, implement [GRPC status code resolution](https://pkg.go.dev/google.golang.org/grpc/status#FromError) to code `OK`, and wrap it all as a `Countable` with a count of failures.

		#### Why should the internal error be permanent?


		### `PartialSuccess` error

		A `PartialSuccess` error will wrap an internal error as permanent, implement [GRPC status code resolution](https://pkg.go.dev/google.golang.org/grpc/status#FromError) to code `OK`, and wrap it all as a `Countable` with a count of failures.


		This API may end up being useful in other places than `PartialSuccess`, such as [in `receiverhelper`](https://github.com/open-telemetry/opentelemetry-collector/issues/14440).

		### `PartialSuccess` error

Conversation

braydonk commented May 8, 2026

Description

Link to tracking issue

AI Usage Disclosure

Uh oh!

codspeed-hq Bot commented May 8, 2026

Merging this PR will improve performance by 48.81%

Performance Changes

Footnotes

Uh oh!

codecov Bot commented May 8, 2026

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

braydonk May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

braydonk May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

braydonk May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hilmarf May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bogdandrutu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

braydonk May 11, 2026 •

edited

Loading

braydonk May 11, 2026 •

edited

Loading

braydonk May 11, 2026 •

edited

Loading

hilmarf May 12, 2026 •

edited

Loading